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Abstract 

'Distribution regression' refers to the situa- 
tion where a response Y depends on a co- 
variate P where P is a probability distribu- 
tion. The model is y = f{P) + M where / 
is an unknown regression function and /i is 
a random error. Typically, we do not ob- 
serve P directly, but rather, we observe a 
sample from P. In this paper we develop 
theory and methods for distribution-free ver- 
sions of distribution regression. This means 
that we do not make distributional assump- 
tions about the error term ^ and covariate P. 
We prove that when the effective dimension is 
small enough (as measured by the doubling 
dimension), then the excess prediction risk 
converges to zero with a polynomial rate. 



1 Introduction 

In a standard regression model, we need to predict a 
real- valued response Y from a vector-valued covariate 
(or feature) X E R'^. Recently, there has been in- 
terest in extensions of standard regression from finite 
dimensional Euclidean spaces to other domains. For 



example, in functional regression (Ferraty and Vieu 
[2006') the covariate is a function instead of a finite 
dimensional vector. 

In this paper, we study distribution regression where 
the covariate is a probability distribution P. This dif- 
fers from functional regression in two important ways. 
First, P is a probability measure on M*^ rather than 
a one-dimensional function. Second, and more im- 
portantly, we do not observe the covariate P directly. 
Rather, we observe a sample from P, which means 
that we have a regression model with measurement 
error ( [Carroll et al] [2006] , [Fan and Truong] |1993| ). 

The formal definition of the problem is as follows. 
We consider a regression problem with variables 
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Figure 1: Illustration of the model - distribu- 
tions Pi, . . . , Pm, Pm+i are unobserved, only th( 
Al, . . . , Xmi Xfn+i sample sets are observable. 



(Pi, li), . . . , (Pm, Y,n) where 1^ e M and each Pi is a 
probability distribution on a compact subset /C C M'^. 
We assume that 



Y^= f{P^)+^l^, i=l,. 



,m, 



for some functional /, where /i^ is a noise variable with 
mean 0. We do not observe Pi directly; rather we 
observe a sample 



y\-ii , . . . , y^im ^ i • 



Thus the observed data are 



{Xi,Yi[ 



(1) 



(2) 



where Xi — {A,;i, . . . , Ai„.}. Our goal is to predict 
a new Y^+i from a new batch Xm+i drawn from a 
new distribution Pm+i- This model is illustrated in 
Figure [T[ 

We model the unobservable probability distributions 
Pi, ... , Pm as follows. Let D denote the set of all dis- 
tributions on K, that have a density with respect to the 
Lebesgue measure. We assume that the distributions 
Pi are an i.i.d. sample from a measure V on D, that 
is, 

i.i.d , 



Pi , . . . , Prti , Pm+1 



V. 
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Note that / : D ^ M. If Q{-\P) denotes the law of Y 
given P, then the joint distribution of (Y, P) is given 
by 

V{Y eA,PeB) = Q{Y e A\P e B)V{P e B) 

Our main result is a theorem where we prove that 
when the effective dimension measured by the dou- 
bUng dimension is small enough, then the estimator 
is consistent and the prediction risk converges to zero 
with a polynomial rate. 

Our results are distribution free in the sense that the 
only distributional assumptions we make in this re- 
gression problem are that /i^ has mean and that 
< By) = 1 for some By- We make no other 
distributional assumptions. 

Outline. In Section [2] we discuss related work. We 
propose a specific estimator for distribution regression 
in Section[3l We call this kernel-kernel estimator since 
it makes use of kernels in two different ways. In Section 
|4]we derive an upper bound on the risk of the estima- 
tor. The proofs can be found in Section [5] In Section 
[6] we analyze the risk bound in terms of the doubling 
dimension, which is a measure of the intrinsic dimen- 
sion of the space. We present numerical illustrations 
in Section[7] Finally, we give some concluding remarks 
m Section |8l 

2 Related work 

Our framework is related to functional data analysis, 
which is a new and steadily improving field of statis- 
tics. For comprehensive reviews and references, see 



Ramsay and Silverman 2005 , Ferraty and Vieu 2006 



A popular approach to do machine learning, such as 
classification and regression, on the domain of distri- 
butions is to embed the distribution to a Hilbert space, 
introduce kernels between the distributions, and then 
use a traditional kernel machine to solve the learning 
problem. There are both parametric and nonparamet- 
ric methods proposed in the literature. 



Parametric methods, (e.g. 


Jebara et al.||2004|, Moreno 


et al. 


2004] , Jaakkola and Haussler 


1998] ), usually fit 


a parametric family (e.g. Gaussians distributions or ex- 



ponential family) to the densities, and using the fitted 
parameters they estimate the inner products between 
the distributions. The problem with parametric ap- 
proaches, however, is that when the true densities do 
not belong to the assumed parametric families, then 
this method introduces some unavoidable bias during 
the estimation of the inner products between the den- 
sities. 

A couple of nonparametric approaches exist as well. 



Since our covariates are represented by finite sets, re- 
producing kernel Hilbert space (RKHS) based set ker- 
nels can be used in these learning problems. |Smola| 
et al.l 120071 proposed to embed the distributions to 



an RKHS using the mean map kernels. In this frame- 
work, the role of universal kernels have been studied by 



Christmann and Steinwart 2010 . Recently, the repre- 



senter theorem has also been generalized for the space 
of probability distributions Muandet et al. 2012 



Kondor and Jebara 2003 introduced Bhattacharyya's 
measure of affinity between finite-dimensional Gaus- 
sians in a Hilbert space. In contrast to the previous ap- 
proaches, Poczos et al. 2012 , Poczos et al. 2011 used 



nonparametric Renyi divergence estimators to solve 
machine learning problems on the set of distributions. 

Although, there are a few algorithms designed for re- 
gression on distributions, we know very little about 
their theoretical properties. To the best of our knowl- 
edge, even the simplest, fundamental questions have 
not been studied yet. For example, we do not know 
how many distributions (m) and how many samples 
(ui, i = l,...,m) we need to achieve small predic- 
tion error. Our paper is providing an answer to this 
question. 

3 The Kernel-Kernel Estimator 

In this section we define an estimator / for the un- 
known function /. Our predictor for Ym+i is then 
Yfn+i = f{Xm+i)- Let Pi denote an estimator of Pi 
based on Xi , and let X he a sample from a new distri- 
bution P = Pm+i- Accordingly, we denote with P an 
estimator of P based on X. 

Given a bandwidth h > and a kernel function K 
(whose properties will be specified later), we define 

/(P) = /(P;Pi,...,P„) 



< 







otherwise. 



D(Pi,P) 



> 



To complete the definition, we need to specify Pi, P 
and D. We will estimate Pi — or, more precisely, the 
density Pi of Pi — with a kernel density estimator 



1 1 

- V- 



B 



(3) 



where B is an appropriate kernel function (see, e.g. 



Tsybakovl |2010| ) with bandwidth h > 0. Here ||a;|| 
denotes the Euclidean norm oi x G M.^. Accordingly, 
Pi is defined by 

P,{A) = / Piiu)du, 

J A 
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for all Borel measurable subsets of M'^. For any two 
probabilities in P and Q in D, we take D{P,Q) to be 
the Li distance of their densities: D{P, Q) = g|| = 
J \p{x) ~ q{x)\dx. Hence, 



fiP) = f{P;P,,...,P„ 



\\p-P,\\ 

h 



(4) 



which we call the 'kernel- kernel estimator' since it 
makes use of two kernels, B and K. 

For simplicity, n will denote the size of the sample X , 
and b will be the bandwidth in the estimator of p. 

In what follows we will make the following assumptions 
on /, K, V, fM, and Y^. 

Assumptions 

• (A 1) Holder continuous functional. The unknown 
functional / belongs to the class M = M{L, (i, D) 
of Holder continuous functionals on D: 

M = \^f : \fiP,) f{P,)\ < LDiP,,P,f^, 

for some L > and < /3 < 1, where D is the 
above specified Li metric on D. In the /3 = 1 
special case this means that / is Lipschitz contin- 
uous. 

• (A2) Asymmetric boxed and Lipschitz kernel. The 
kernel K satisfies the following properties: K : 
[0, oo] — >■ ]R is non-negative and Lipschitz contin- 
uous with Lipschitz constant Lk- In addition, 
there exist constants < < 1 and < r < R < 
oo such that, for all a; > 0, it holds that 



KI. 



{xeB{0,r)} 



< K{x) < /r,e 



{xeB{a,R)}- 



• (A3) Holder class of distributions. The distribu- 
tion V is supported on the set of distributions 
■Hfcfl) with densities that are 1-smooth Holder 



functions, as defined in Rigollet and Vert 2009 



(A4) Bounded regression. We will assume that 

SUppgp|/(P)| < /max for SOmC /max > 0. AlsO, 

/ij has mean and P(|Ki| < By) — 1 for some 

By < OO. 

(A5) Lower bound on mYai<ci<m+in-i. Let n = 

^ ^ k 

We assume that e"^^'' 



nimi<i<„j+i n. 
as m — > oo. 



' /m 



• (A6) Relationship between n and h. Assume that 
C**n^2+fc <rh/A where C* is defined in Jgl. 



4 Upper Bound on Risk 

We are concerned with upper bounding the risk 



R{m, n) 



E 



|/(P;Pi,...,P™)-/(P)| 



where the expectation is with respect to the joint 
distribution of the sample {Xi,Yi), . . . , the 
new covariate P — Pm+i and the new observation 
Xjn^i. Note that the absolute prediction risk is 
E|"K — F| < R{m,n) + c, where c — E(|/i|) is a con- 
stant. So bounding the prediction risk is equivalent to 
bounding R{m, n), which we call the excess prediction 
risk. In what follows, C, ci, C2, . . . represent constants 
whose value can be different in different expressions. 

Let B{P,h) = {P G D : D{P,P) < h} denote the Li 
ball of distributions around P with radius h. We will 
see that the risk depends on the size of the class of 
probabilities D. In particular, the risk depends on the 
small ball probability 

^pih)^V{B{P,h)), 

where P is a fixed distribution and ^p{h) is a function 
of P. 

Our first result. Theorem [T] provides a general upper 
bound on the risk. In our second result (Section [6]) 
we show that when the effective dimension measured 
by the doubling dimension is small, then the risk con- 
verges to zero. We also derive an upper bound on the 
rate of convergence. 

Theorem 1 Suppose that the assumptions stated 
above hold. Let b = n^Ts be the bandwidth in the 
density estimators pi . Then 




R(m,n) < -E 
h 



+ Q 



C,h^ 



1 



$p(r/i/2) 



+ (to + l)e 

where the constants Ci 's are specified in the proof. 
5 Proof of Theorem [T] 

In this Section we prove our main result. Theorem [T] 
The main idea of the proof is to use the triangle in- 
equality to write 

i?(m,n) =E|/(P; Pi,..., P„)~/(P)j 

< E|/(P; Pi, . . . , P„) - f{P; Pi, ... , P„)| (5) 



+ E|/(P;Pi 



,Pm)-f{P)\- 



(6) 
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In Sections 5.2 and |5.3| we will derive upper bounds 
for ([5| and ([6|), respectively. Section 5.1 



contains a 



series of technical results needed in our proofs. 
Throughout, we let Ki = K ( ^^'^•'^^ ), K, 



K 



(^) 



and — Kj — Ki , for i — \, 



that, for ease of readability, we have omitted the de- 
pendence on h. 

5.1 Technical Results 

5.1.1 Li Risk of Density Estimators 

In this section we bound ¥.[D{P, P)\P] ^ E[/ 
the L\ risk of the density estimator p of p, uniformly 
over all P in D. To this end, suppose that Ui > n for 
all I = 1, 2, . . . , TO -f 1, and let bi = b = . In 

this case, the following lemma provides upper bound 
on the Li risk of the density estimator. 



Lemma 2 



E[D{P„P,)\P,] < Cn-^, 
E[DiP„P,)] < Cn-^, 



(7) 



where 



C^Co{ci+C2), (8) 

with Co, Ci and c\ constants specified in the proof. 

Proof. Recall that we assume that V is supported 
on the set 'Hfc(l) of distributions, which are 1-smooth 
A;-dimensional densities as defined in |Rigollet and Vert| 
20091. 



Let E 



D2[p^,pm 



E 



denote the 



integrated mean squared risk for the density estimator 
Pi of a fixed density pi. It then follows from Lemma 4.1 



of Rigollet and Vert 2009 that (with an appropriate 



riib'l 



kernel function B), 
for some constants Ci, C2 > 0. 

From Jensen's inequality, we have that E[X\ < 
{K[X^]y/^ for any X random variable. We also know 
that (a + 6)1/2 <• ^1/2 _|_ ^1/2 j^j. a,b> 0, therefore 



E[D2{P,,P,m] < [clb^ 
< cibi + 



n,b^ 



C2 

1/2, fc/2 ■ 

n, ' b: ' 



1/2 



Since the distributions in D are supported on a com- 
pact set and the kernel B has also compact support. 



we have, for an appropriate constant cq > 0, 



\Pt -Pi\ < coW / (pi -Pi) 



. , m. Note Therefore, 



E[D{P,,Pi)\P,] < CoE[D2iP^,P^)\P^] 

<co(ci6,+ ) 

< co(ci + C2)n^^, 

where the last step follows from our assumptions that 
n'^^'^b'''^^ < n~in^(^ = n~T^, and thus 



cibi 



C2 



1/2, fe/2 



< (ci + C2)n . 



□ 



Next, we show that the terms D{Pi, Pi) are uniformly 
bounded by a term of order 0{h), with high probabil- 
ity. 

Lemma 3 With probability no smaller than 1 — (to + 
l)e-3"^ , D{P„Pi) < ^ for alli = l,...,m + l. 

Notice that by Assumption (A5), 1 — (m -f 

-1 \ — i 2 + fc -, 

l)e 2" ^ ^. 1. 

Proof. From McDiarmid's inequality, for any e > 
we have that 

n\\P^~P^\\l-n\P^-P^\\l>e)<e-"'" 



(see, for example, section 2.4 of Devroye and Lugosi 
2OOT] ). Thus, 



P(||p,-p,||i >E||p,-p,||i+n-5T^) <e-5"'+% 

_ 2 k 

since nn 2+fc — n^+'' . This implies that 

P( max \\p^-p^\\l>E\\p^-Pi\\l+n-^)) 

l<i<7n'\-l 

< (TO + l)e-5"^ ^0, 
by assumption (A5). Therefore, 

-PiWi < E\\p^-p,\\i+n-^)) 

-PiWl < (l + Co(ci +C2))n"5Tfc). 

This implies that with 

a = (l + co(ci +C2)) (9) 



1 - (to + l)e 

< P( max 

l<i<m+l 

< P( max 

l<i<m+l 
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and using assumption (A6), we have that 

D{P^, Pi) < C^n-T^ < ^ for all i (10) 

1 ^ 

on an event flm^n, where P{^m n) < {m + l)e^2"^+'° . 
Here fi^ri n denotes the complement of flm,7i- O 

5.1.2 Other Lemmata 

Throughout this section we will make use of the con- 
stant C, defined in (|8|. In what follows, we will need 
a few lemmas that we list below. Their proofs can be 
found in the supplementary material. 

The following lemma provides an upper bound on 
V{J2iLi = 0) with the help of small ball proba- 
bilities. 



Lemma 4 



■III 





1 




$p(r/i) 


/ em 



We will also need the following lemma. 

Lemma 5 

1 , 



E 



1 + 
niK 



1 



^p{Rh) 



The following lemma provides an upper bound on [e^ 



Lemma 6 Assume that the kernel function K is Lips- 
chitz continuous with Lipschitz constant Lk- We have 
that 

By definition, |e,| = \K, - K,\ = \Ki^^^) - 
K{^^^)\, which is a deterministic function of ran- 
dom variables P, Pi, P, and Pi. We will denote this 
deterministic relationship as = ei{P, P, Pi, Pi). The 
following lemma shows that for any k > 0, 



J2\^^iP,P,P^,P^)\<^\{P^}7U,P 



can be lower bounded by a non-trivial quantity that 
does not depend on P and {-Pi}™!. 

Lemma 7 For any k > we have that 

¥{J2 |e,(P,P,P.,P.)l < n\{P,}T=i,P) > V, 



where rj — rj{K, n,m) — 1 



2LKmC 



The following lemma provides an upper bound on the 
expected value of X^i^i kd- 



Lemma 8 



E 



El 

.i=l 



PAP.}T=i 



2LKCm 1 
< — n^w^. 



The next lemma shows that P^X^I^Li < 2:) '^^^ be 
upper bounded by a small quantity as well. We assume 
that n, = n and 6,- = b for all i. Define 



C = C,in,m) = — E 
em 



Lemma 9 



1 



+ {m + l)e 2 



A- - Oj < P(^^. <K)<C 

1=1 i=l 

5.2 Upper bound on Equation [5] 

Let A/ = \f{P;P^,...,P^)-J{P;P^,...,P^)\. Our 
goal is to provide an upper bound on E[A/]. 

Introduce the following events: Eq — {Yli^i = 0}, 
Ei = {Q< E^K^ < K}, E2^{K< Y.^Ki\. Sim- 
ilarly, Eo = {E.i^. = 0}, Pi - {0 < Y.^K^ < 

K}, E2 ^ {K < J:^K^}. Obviously E[A/] - 



A/ 



Based on the sign of J2i ^^^d ^ Ki, there are four 
different cases, (i) If Yli Ki > ^ and Yli ^i > 0, then 

l%7^ - ^T^\- (ii) " ^^^^ > 
J:,K, = 0, then A/ - (iii) If E.^. - 

and K, > 0, then A/ = |^^|, and finally (iv) 
if Y.^K^ = and Y.i Ki = 0, then A/ = 0. From this 



it immediately follows that E[A//£;j,/j; 



0. 



WhcnE.^.>0, E.£f; 



< By. Therefore, 



E 



E y^~Yi ^Eo^^^i ^^E^) 

< PyE [I{^^K,>0/\J:,K,=0}\ 

= PyP(^if, >0,^i?, =0) 

i i 
m 

< ByPC^K, = 0) < BYC{n,m). 



Similarly, 

E 



n 2+fc . 



< 



By f dV{P) 



em J <^p{rh) 
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It is also easy to see that 



E 



< E 



< E 



E 



< 2ByE 



2B f 

= 2ByP(y 0<Ki< K/2) < — - I 
^-^ em j 



2By f dV{P) 



^p{rh)' 



Similarly, 



E 



Aflg^ {Ie, + Ie,)] < 2i?yP(^ Q<K,< K/2) 



< 2BYC{n,m). 



All that left is to upper bound E 



next lemma provides an upper bound for this. 
Lemma 10 



The 



E 







1 


MIeJe, 


<c4e 

n 


^p{Rh) 



n . 



The proof can be found in the supplementary material. 

Finally, putting the pieces together we obtain the fol- 
lowing theorem. 

Theorem 11 

E|7(i5;Pi,...,P,„)-/(P;Pi,...,P„,)| 



<Ci-E 
h 



1 



^p{rh/2) 



n + 6*2— E 

TO 



<^p{rh/2) 



+ (to + l)e~ 

The proof can be found in the supplementary material. 

5.3 Upper bound on Equation [6] 

In this section we show that under the above specified 
conditions E|/(P; Fi, . . . , P™) — /(P)| can be upper 
bounded by 



Ci{h^) + C2 WE 



1 

m<^p{rh/2) 



^E 



1 

^p{rh/2) 



= E 
= E 
< E 



" ' 



+ E[|/(P)|/{5^^K,.o}| 



< E 
+ E 



>0} 



We will bound each of the three terms next. For the 
first term, since / is Holder-/? we have 



„\ j:r\fip^)^fip)\K. , 

<E hj:.K.>o} 



<L{hRf, 



where in the last step we used the fact that 

D{P.„P)"K, = D{P,,P)^K < {hRfK,. 

since supp{K) C B{0,R). 

We now bound the second term. 



E 



= E 

< E 

< E 



hi:.K,>K} 
hi:.K,>K} 



ByV{K >J2^' 

i 

By f dV{P) 



em J $p(r/i) 



(A4) implies that P(|/Xi| < By) = 1, i.e. By is 
a bound on the noise. The last step follows from 
Lemma [4j For the first term in the above expression, 
we use the following lemma. Its proof can be found in 
the supplementary material. 

Lemma 12 



where the expectation is with respect to the random 
probability measure P in P. 

We have to bound E|7(P; Pi, ... , P„0 - /(P)|. Note 
that Y, = f{Pi) + and 

E|/(P;Pi,...,P„,)-/(P)| 



E 









hT.,K,>K} 



< By 



'l + l/K f dV{P) 



m,K J ^p{Rh)' 
Finally, we bound the third term using Lemma [4] 

/max [ dV{P) 



^—^ em J 



em J ^p{rh) 
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Putting everything together, we have 
E|/(P;Pi,...,P„)-/(P)| 



< L{hR)^ + 



1 + 1/K f dV{P) 



mK 

By f dV{P) , /„,a 



<Pp{Rh) 
dV{P) 



em J ^p(rh) em J $p(rft,) 
' C3 



< Cih^ + a 



1 



E 



1 



E 



1 



/i>p{rh/2) 

Note that $p(r/i/2) < <^p{rh) < <^p{Rh). 
6 Doubling Dimension 

The upper bound on the risk in Theorem [T] depends 



on the quantity E 



In future work, we will 



*p(rh/2) 

show that, without further assumptions, this quantity 
can be quite large which leads to very slow rates of 
convergence. This is because the covering number of 
the class 'Hfc(l) is huge. For this paper, we concentrate 
on the more optimistic case where the support of V has 
small effective dimension. 

One way to measure effective dimension is to use the 



doubling dimension. Following Kpotufe 2011 



we say 



that 7-" is a doubling measure with effective dimension 
d if, for every r > and < e < 1, 



P(6(s,r)) ^ /c\rf 



(11) 



If d denotes the doubling dimension of measure V, then 
the •\/E[l/(r7i$p(r/i/2))] term in Theorem [l] can be 
upper bounded as follows: 



TO$p(r/i/2) 



1 $p(l) 



1 



< J-C{rh/2) 



-d 



E 



TO $p(r/i/2) $p(l) 
C 



$p(l) 



< 



Note also that when mh'^ < 1, then 



< 



In 



this case, as a corollary of Theorem [T] we now have 
that 



+ (12) 

for appropriate constants Ci, C2 and C3. 

To derive the rates for the risk, wc consider two sep- 
arate cases, depending on whether the third term in 



the right hand side of ( 12 ) dominates the first term or 
not. 



Thus first assume that 



1 



(13) 



mh'^ V^/^<i+i„i/(fe+2)_ 

so that the risk becomes, asymptotically, 
O {h^ + \J~-^^^ ■ The optimal choice for h is 
then 8 (to^^/(^^+'*)), yielding a rate for the risk 

i?(m,n) -0(TO-'3/(2/3+rf)) . 
Notice that this choice of h ensures that our assump- 



tion (A6) is met, since in this case (13) implies that 

from which we obtain that 

h = Q (mr 23Td ) = ( ik+wJi+d+r, ) = O | n 



This rate is reasonable because if the number of sam- 
ples per distribution n is large compared to the num- 
ber TO of distributions, then the learning rate is lim- 
ited by the number of distributions to and is in fact 
precisely the same as the rate of learning a standard 
/J-Holder smooth regression function in d dimensions. 
That is, the the effect of not knowing the distribu- 
tions Pi , . . . , Pm exactly and only having a finite sam- 
ple from the distributions is negligible. 



For the second case, suppose that 



= O 



1 



/jd+lj^l/(fc+2) 



(14) 



Then, R{m,n) = Q ( ,,d+i J/(fc+2) + hl^), which im- 
plies that the optimal choice for h is h — 
e (fc+2)(3+d+i) ^ ^ giving the rate 

R{m,n) = O (n~WF2W+dTn'^ . 

Just like before, this choice of h does not violate as- 
sumption (A6) since 



h = Q in (i'+2)(ft+d+i) ] =i}ln '=+2 



Notice that, (14 1 also implies that 



2ff + d 

TO = n('=+2)(/3+d+i) 



In this case, the rate is limited by the number of sam- 
ples per distribution n, as expected. Notice that the 
rate gets worse as the dimensionality of each distribu- 
tion k grows and as the smoothness /3 of the regression 
function deteriorates. 
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Remark. If there is no additive noise, i.e. fii = 0, 
similar calculations yield that R{m,n) = O (^m~^+^^ 

when n = f2 ( m(''+'*)('=+2) J , and R{m,n) = 

O (^n~ (fc+2)(/3+<i+i) ^ otherwise. While the rates seem 

reasonable, establishing optimality of the rates by 
demonstrating matching lower bounds is an open ques- 
tion that we plan to investigate in future work. 



7 Numerical Illustrations 

The following experiments serve as a proof of concepts 
to demonstrate the applicability of the distribution re- 
gression estimator in Section [3j In these experiments, 
we used triangle kernels {k{x) = 1— |a;|if— l<x<l, 
and otherwise). We set all the n,ni, . . . ,n„i set 
sizes and 6, &i, . . . , 5™ bandwidths to the same values, 
which will be specified below. In the first experiment, 
we generated 325 sample sets from Beta{a, 3) distri- 
butions where a was varied between [3, 20] randomly. 
We constructed m = 250 sample sets for training, 25 
for validation, and 50 for testing. Each sample set 
contained n = 500 Beta{a, 3) distributed i.i.d. points. 
Our task in this experiment was to learn the skewness 
oiBetaia.b) distributions, / = 2(b-a)Va+b+i _ -y^e con- 

sidered the noiseless case, i.e. /x was set to zero. Our 
estimator of course is not aware of that the sample 
sets are coming from beta distributions, and it does 
not know the skewness function values in the test sets 
either; its values are available only in the training and 
validation sets. 

To find appropriate bandwidths h and /i, we sampled 
100 i.i.d. randomly and uniformly distributed values 
in [0,1], evaluated the MSE performance of the dis- 
tribution regression estimator on the validation test 
using these bandwidths parameters, and then chose 
that bandwidth parameters the lead to the best values 
on the validation test. To estimate the L2 distances 
between pi and p, we calculated their estimated values 
in 4096 points on a uniformly distributed grid between 
the min an max values in the sample sets, and then 
estimated the integral J{p{x) — pi{x))'^d{x) with the 



rectangle method numerical integration. Figure 2(a) 
displays the predicted values for the 50 test sample 
sets, and we also show the true values of the skewness 
functions. As we can see the true and the estimated 
values are very close to each other. 

In the next experiment, our task was to learn the en- 
tropy of Gaussian distributions. We chose a 2 x 2 co- 
variance matrix S — AA^ , where A S E^^^, and Aij 
was randomly selected from J7[0, 1]. Just as in the 
previous experiments we constructed 325 sample sets 
from {7V(0,i?(ai)S^/^)}?ii. Where i?(ai) is a 2d ro- 



tation matrix with rotation angle = i7r/325. From 
each A/'(0, i?(ai)E^/^) distribution we sampled 500 2- 
dimensional i.i.d. points. Similarly to the previous ex- 
periment, 250 points was used for training, 25 for se- 
lecting appropriate bandwidth parameters, and 50 for 
training. Our goal was to learn the entropy of the 
first marginal distribution: / — ^ ln(27recr^), where 
0-2 = Mi^i and M = R{ai)T.R^ {ui) € M^. ii was 
zero in this experiment as well. Figure 2(b) displays 
the learned entropies of the 50 test sample sets. The 
true and the estimated values are close to each other 
in this experiment as well. 
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Figure 2: (a) Learned skewness of Beta{a, 3) distribu- 
tion. Axis x: parameter a in [3,20]. Axis y: skewness 
of Beta{a,3). (b) Learned entropy of a Id marginal 
distribution of a rotated 2d Gaussian distribution. 
Axes x: rotation angle in [0,7r]. Axis y: entropy. 



8 Discussion and Conclusion 

We have presented an estimator for distribution re- 
gression which is distribution-free in the sense that the 
estimator makes no strong distributional assumptions 
on the error variables. We derived upper bounds on 
the risk of the estimator and, in particular, we ana- 
lyzed the case with a finite doubling dimension. 

We note that our rates are faster than the logarith- 
mic rates that are sometimes obtained in measurement 



error nonparametric regression models as in Fan and 
:uong 1993 . The reason is that the logarithmic rates 
occur when the measurement error is Gaussian. Our 
measurement error corresponds to \\pi — Pi\\ which is 
not Gaussian for finite and which decreases when rii 
increases. In the standard measurement error model, 
the error is 0(1) and is not decreasing. 

In future work, we will prove lower bounds which show 
that, without further assumptions (such as assump- 
tions about the doubling dimension), the rates can 
be very slow. Also, we will show that similar results 
hold for other estimators such as k-rni estimators and 
RKHS estimators. 



Barnabas Poczos, Alessandro Rinaldo, Aarti Singh, Larry Wasserman 



References 

R.J. Carroll, D. Ruppert, L.A. Stefanski, and CM. 
Crainiceanu. Measurement error in nonlinear mod- 
els: a modern perspective, volume 105. Chapman & 
Hall/CRC, 2006. 

A. Christmann and I. Stcinwart. Universal kernels on 
non-standard input spaces. In NIPS, pages 406-414, 
2010. 

L. Devroyc and G. Lugosi. ComMnatorial methods in 
density estimation. Springer, 2001. 

J. Fan and Y.K. Truong. Nonparametric regression 
with errors in variables. The Annals of Statistics, 
pages 1900-1925, 1993. 

F. Ferraty and P. Vicu. Nonparametric Functional 
Data Analysis: Theory and Practice. Springer Ver- 
lag, 2006. 

L. Gyorfi, M. Kohler, A. Krzyzak, and H. Walk. A 
Distribution-Free Theory of Nonparametric Regres- 
sion. Springer, New-york, 2002. 

T. Jaakkola and D. Haussler. Exploiting generative 
models in discriminative classifiers. In NIPS, pages 
487-493. MIT Press, 1998. 

T. Jebara, R. Kondor, A. Howard, K. Bennett, 
and N. Cesa-bianchi. Probability product kernels. 
.JMLR, 5:819-844, 2004. 

R. Kondor and T. Jebara. A kernel between sets of 
vectors. In ICML, 2003. 

S. Kpotufe. k-nn regression adapts to local intrinsic 
dimension. arXiv preprint arXiv: 11 10.4300, 2011. 

P. Moreno, P. Ho, and N. Vasconcelos. A KuUback- 
Leibler divergence based kernel for SVM classifica- 
tion in multimedia applications. In NIPS, 2004. 

K. Muandet, B. Scholkopf, K. Fukumizu, and F. Din- 
uzzo. Learning from distributions via support mea- 
sure machines. arXiv.org, stat.ML, February 2012. 

B. Poczos, L. Xiong, and J. Schneider. Nonparametric 

divergence estimation with applications to machine 
learning on distributions. In UAI, 2011. 

B. Poczos, L. Xiong, D. Sutherland, and J. Schneider. 
Nonparametric kernel estimators for image classifi- 
cation. In Computer Vision and Pattern Recogni- 
tion, 2012. 

J.O. Ramsay and B.W Silverman. Functional data 
analysis. Springer, New York, 2nd edition, 2005. 

P. RigoUet and R. Vert. Optimal rates for plug-in 
estimators of density level sets. Bernoulli, 15(4): 
1154-1178, 2009. 

A. Smola, A. Gretton, L. Song, and B. Scholkopf. A 
Hilbert space embedding for distributions. In ALT, 
2007. 



A.B. Tsybakov. Introduction to Nonparametric Esti- 
mation. Springer, 2010. 



Distribution-Pree Distribution Regression 



Supplementary material 
Proof of Lemma [2] 

Proof. The proof follows the argument of |Gyorfi et al.| 



2002 



.(Ea-.<k).p(1:a-(^)<k 



< 



{D(Pi,P)>rh} 



since according to our assumptions on kernel K if for 
some i it holds that D{Pi, P)/h < r, then Ki > K. 
Therefore, 



■i!L I in 



i=l 



Pj dV{P) 



= E[P(^/{C(P,,P)>,^} =0|P)] 

i=l 

^ (^hD{P,,P)>rh} = 

= J[l-V{Pi e B{P,rh)\P)rdV{P) 
< J exp[^mV{Pi e B{P,rh)\P)]dV{P) 
= J exp[-TO-p(Pi e B{P,rh)\P)] 



(15) 
(16) 



mViPi € B{P,rh)\P) 



< maxitexp(— w) / 



dV{P) 



mV{Pi e B{P,rh)\P) 



(17) 



< 



1 



dV{P) 



e J mV{Pi e B{P,rh)\P) em 



1 



where we used in ilbl, (16), and (17) respectively that 



{Pi} are iid, (1 — u)™ < exp(— itm) for all < u < 1, 
m > 1, and max(Mexp(— u)) = i. □ 



Proof of Lemma [s] 
Proof. 

1 



E 



< E 



1 + 1/^ 

1 + 



< E 



1 + 1/K 



1 + KY^iI{D(P,.,P)<hR} 



< 



l + l/K 

K 

l + l/K 
K 



1 



1 



1 + Yji I{D{P,.P)<hR} 



< 



l + l/K 
K 

l + l/K 
mK 



I 



1 + J2ihD{P,.P)<hR} 
I 



P 



$p(i?/l)_ 

where the second-to-last line uses the fact that K_ < I 
and the last line follows since for a binomial random 
variable B(m,p), Elj—f^ .1 < , < ^. □ 

Proof of Lemma [6] 

Proof. D{P, Q) is a distance, therefore the triangle 
inequality holds, and we have that 



lej = \K,,-K,\ - 



Ki^^)-Ki^^) 
h h 



< ^\D{P,P,)-D{P,P,)\ 

< ^{D{P,P) + D{P,,Pi)). 

Here we used that 
DiP, P,) - D{P, Pi) 

< [D{P, P) + D{P, Pi) + D{P„P,)] - D{P, Pi) 
= D{P,P) + D{P,,P,), 

and 

D{P,Pi)-D{P,P,) 

< [D{P, P) + D{P,Pi) + D{P,,h)] - D{P, P,) 
= D{P, P) + D{P,,Pi). 

□ 

Proof of Lemma [7] 

Proof. From Markov's inequality, for any X, Y and 
constant k > 0, 

i<m^+F(ixi<.|F). 



Thus, 



E 

i 

> I 
= 1 ^ 

> 1 

> 1 



{P.}T=i.P 



(18) 



E^n\^m,p] 



Pk 
hn 

hn 



J2n{D{P,P) + D{P,,P,))\P,,P] (19) 



m2Cn 2+fc —rj{K, 12,171). 



Here ( 19 ) holds due to Lemma[6j and we also used 
□ 
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Proof of Lemma [s] 

Proof. The term E 
bounded by 

Lk ™ 



Em 



p,{mT=i 



IS upper 



EE 



D{P,P)+D{P,,P,) 



h 



□ 



Proof of Lemma [9] 

Proof. Recall that D{Pi,Pi) < rh/A for all i on an 
event f7„,„ and that P(f^^,„) < {m + l)e'i'''^ . So, 

on r^TjT^ 

Dih,P)<D{P,,P,) + D{P,P) + D{P,,P) 
<i?(P„P) + ^. 

Now, using the event Slm,n defined in Lemma [Sj 

m 7n 

p(^i^. = o) <p(^i?, <k) 

1=1 i=l 

■m m 

= p(a„,„,5]i?. <k) +p(c!^^„,^i^, <k) 

i^l z^l 
m 

<p(f2„,„,5]i?, <K)+P(f]^j 

m ^ 

and 

m m 
P(f^m,«,^i?z <k) ^m^,n,Y.lDiP,.P)>rh^^) 

0) 



i=l 



< 



(E ^D{Pi,P)>rh/2 
i=l 



em 



1 



$p(r/i/2) 



The result follows. □ 

Proof of Lemma 1101 
Proof. 



E 



E 



2 Eo 



= E 



IeJ 



■2 E2 



< ByE 

= ByE 

= ByE 



IeJ 



E 



2^ Bo 



ByE 



E 



< ByE 



(E.^»)(E,k.-|) 
(E,^.)(E,^.) 
(E.k.|)(E,i^.-) 
(E,^i)(E,^.) 

, Ej- kjl (E» J J 

E, 



■2 "Bo 



< 2ByE 



'-E2 



E 



E 



E, 

^|6,||P,{PJ™1 



E, 



< 2By^2Zmrr^E 
h 



< 2By-^2dmn^'^ ^E 

h nriK 



^p{Rh) 



c^\e 

h 



1 



n 2+fc . 



$p(i?/i) 

where we used Lemma [8] and Lemma [3 □ 

Proof of Theorem 1111 
Proof. 

E\f{P;P,,...,P„,)-f{P;Pi,...,P,n)\ 



< E 



< Ci-E 
h 



1 



C.-E 



^p{Rh) 
1 



By 

-3ByC + 3— E 
em 

L 1 

n 2+fc + C2— E 



1 



$p(r/i) 
1 

^p{rh) 



1 



$p(r/i/2) 



+ (m + l)e-^"^"'' 



Note also that $p(r/i/2) < $p(r/i) < <^p{Rh). □ 

Proof of Lemma 1121 

Proof. Notice that iiJ2tI^t> K> 0, 



var|'^^|P,Pi,...,P,„) 
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Using this and Holder's inequality, we get: 
J2^ fJ-i^t 



E 



= E 



< E 



< E 



< E 



E 



Ki>K\P, Pi, ■ ■ ■ ,Pn 



P,Pl, . . . ,P„i j I{J2^ Ki>K} 



< By\ E 



< By 



'1 + 1/K f dV{P) 



mK J $p(i?/i)' 



The second inequality holds since K{x) < 1 and the 
last step stems from Lemma |5] □ 



